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ABSTRACT 



The purpose of the current study was to use multilevel 



modeling to quantify and explain the sources of score variation in 
standardized patient (SP) encounters. Through laypersons trained to portray 
SPs and record medical student actions, SP examinations allow the measurement 
of examinees' clinical and interpersonal skills. In this study, the SP test 
assesses the clinical skills of physicians about to enter supervised 
practice. Four cases were drawn from the SP bank. The number of examinees who 
saw each of these cases ranged from 357 to 565 with this number being reduced 
in the checklist (objectively scored) models to those who had already taken 
step 2 of the U.S. Medical Licensing Examination. The multilevel modeling 
software package HLM5 was used to estimate the proportion of score variation 
between SPs and training sites, assess the relationship between the skill 
scores and SP characteristics, and quantify the proportion of variation 
explained when SP characteristics are added into the model. Results of 
previous generalizability analyses have demonstrated that there is little 
variation in scores across SPs and sites, and that most of the variation in 
scores is due to case specificity. The findings from this study show that 
although SP variability was negligible for checklist scores, variability was 
quite large for interpersonal scores. Although a major advantage of using 
multilevel modeling is to explain variation at various levels, the variables 
used in this study were not helpful in explaining the variation between SPs. 
Overall, however, this study should be seen as an important first step in 
using multilevel modeling to explore the variability of SP examinations. 
Careful consideration of SP and site characteristics should be captured and 
analyzed statistically so that steps can be taken to implement fair and 
reliable examinations. (Contains 4 tables and 17 references.) (SLD) 
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Standardized Patient (SP) Examinations are widely used by medical schools, testing and 
certification organizations to evaluate sets of skills not readily measurable with written multiple- 
choice examinations (Reznick, 2000; Whalen, 2000). Through using laypersons trained to portray 
SPs and record student actions, these examinations allow the measurement of examinees’ clinical 
and interpersonal skills. Albeit valuable, these examinations bear limitations, mainly decreased 
reliability of examinee scores attributable to variation in SP portrayal, scoring and the limited 
number of cases seen by the student. 

Regardless of whether SP exams are being used by a medical school for teaching 
purposes or a medical testing organization for licensure or certification, it is critical that scores 
accurately reflect the appropriate clinical skill level of the examinees. Threats to reliability may 
increase when exams are administered on a large scale and it becomes necessary to train multiple 
SPs to portray the same case across multiple testing sites. Much research has focused on 
quantifying sources of variability in SP exams since any type of unwanted variation could have a 
deleterious impact on pass/fail decisions. These studies’ conclusions are not easily discerned. 

The majority of initial studies indicated that the use of multiple SPs did not cause large 
discrepancies in total test score when examinees were randomly assigned to SPs (van der Vleuten 
& Swanson, 1990). Swanson & Norcini (1989) found that raters nested within a case explained 
only 1% to 2% of the observed score variance and De Champlain et al. (1998) found that 
multiple SPs could similarly assess examinee performance, leading to identical mastery-level 
decisions for nearly all students tested. Previous research has indicated that it is not variation in 
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raters but rather case content, i.e. case specificity, which contributes most to variability in a 
student’s scores. (Klass, Fletcher, King, Durinzi, Nungester, Clauser & Rip key, 1992; Swanson 
& Norcini, 1989; Tamblyn, Klass, Schnabl & Kopelow, 1991; van der Vleuten & Swanson, 
1990). 

However, other studies have indicated that multiple SPs may introduce enough error to be 
consequential at the case level. Swanson and Norcini (1989) found that, although the use of 
multiple SPs playing the same case for different examinees did not affect the total test score, 
there were cases in which raters disagreed. In addition, Colliver, Robbs & Vu (1991) reported 
statistically significant differences between failure rates among SPs simulating the same cases. 
More recently, differences in intra-rater reliability (DeChamplain, Macmillan, Klass, Margolis, 
1999) and the effects of rater discrepencies on pass/fail decisions for heterogeneous groups have 
been large enough to warrant concern (DeChamplain, Gessaroli & Floreck, 2000). 

Research has also shown that the use of multiple SPs and raters appears to have less 
influence on objectively scored measures such as checklists and more effect on the variability of 
interpersonal skills scores. For example, Colliver et al. (1994) found that the use of multiple 
raters on the same case decreased inter-case reliability more on measures of interpersonal and 
communication skills than checklists, total scores and written scores. Boulet, et al. (1998) found 
inconsistencies across holistic scoring of post encounter notes which supports previous research 
(Colliver et al., 1994) suggesting that subjective scores are more highly influenced by individual 
variability. In summary, the use of multiple SPs to portray and score a given case does not 
impact all clinical scenarios in a consistent fashion. 
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Although testing organizations that accommodate large volumes of examinees are 
concerned with the effects of administering forms across multiple sites, fewer studies have 
examined these effects. While some have found little or no differences in scores of candidates 
taking the same test administered at different sites (DeChamplain, Macmillan, Klass, Margolis, 
1999; Reznick et al.. 1993) others have indicated that candidates scores can be influenced by the 
site at which they take an exam, especially when training offered to SPs is minimal (Tamblyn et 
al., 1991; Petrusa et al., 1991). Interestingly, Tamblyn et al. (1991) reported a great deal of 
variation in the reliability of individual raters suggesting that rater characteristics may be 
responsible for this variability. Unfortunately, fewer studies have systematically examined the 
impact of rater characteristics on score variability. 

It is clear that more research is needed to assess the sources of variation present when 
multiple SPs portray and score identical cases across multiple testing sites. According to De 
Champlain et al. (1998), we should not conclude that relatively small amounts of rater and site 
variation in generalizability studies are necessarily synonymous with negligible effects on 
examinee scores. Rater components are disproportionately small because the variation 
associated with case content is typically very large. The effect on mastery level decisions and 
rank ordering of examinees could very well be affected. 

One limitation of (the commonly used) generalizability theory in quantifying sources of 
score variation in SP exams is that crossed designs are more desirable than nested designs 
(Shavelson & Webb, 1991). When nesting is inherent in the design many variance components 
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cannot be estimated. A further limitation is that generalizability analyses do not allow us to 
address other important issues including characteristics that contribute to unwanted sources of 
variation. Understanding sources of score variation is helpful but this alone cannot help test 
developers to reduce unwanted variation unless actual causes of variation are known. Multi- 
level modeling makes it possible to not only quantify sources of variation which would be 
difficult to estimate using generalizability analysis, but to examine how factors such as SP gender 
and experience explain such variation. 

The purpose of the current investigation was to use multi-level modeling to quantify and 
explain, in a more comprehensive manner, the sources of score variation in SP encounters. The 
partitioning of variance and covariance components among various levels (e.g. student, rater, test 
site) and determining the relative weight and significance of individual SP characteristics will 
provide important information to eventually implement a fair and reliable SP exam. Additional 
information on sources of score variability will enable us to make more informed decisions 
regarding scoring, calibrating and equating procedures that will ultimately enhance decision 
consistency and accuracy rates. The models selected for this investigation allow for estimation 
of rater and site effects without the large variance components related to case specificity which 
provides important feedback for case development and training activities. 

Method 

SP examination and measurement instruments 

In the present study, the SP test assesses the clinical skills (history taking, physical 
examination, communication) of physicians about to enter supervised practice. Examinees 
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proceed through (six to ten) cases and encounter patients in a setting intended to reflect an 
ambulatory clinic. Subsequent to each 1 5-minute encounter the SP performing the case records 
the performance of students using a checklist and the Patient Perception Questionnaire 
(PPQ).The checklist is tailored specifically to the standardized patient’s complaint by a group of 
subject matter experts and contains 10 to 25 dichotomously scored items targeting behaviors 
deemed critical for success on the encounter. Unlike the checklist, the case-invariant PPQ is 
comprised of seven 5-point Likert type items that measure the student’s interpersonal skills (IPS). 
Percent correct scores are calculated for both checklist and IPS scores. The reliability 
(Cronbach’s alpha) of checklist and IPS scores is typically lower than traditional multiple-choice 
examinations due to the limited number of items. 

One final measure, USMLE™ Step 2, was used to adjust for ability when modeling 
checklist scores. According to Bryk & Raudenbush (1992), the use of a covariate related to the 
dependent measure (in multi-level modeling) is useful because it reduces the unexplained 
variance at level- 1 and increases precision of estimates at higher levels. Descriptions of 
variables are provided in Table 1 and summary descriptive measures are shown in Table 2. 
Methodology 

Four cases were drawn from the bank. One case measured biomedical skills, another 
grave illness and two measured routine counseling skills. The second routine counseling case 
was performed by both a male and female SP whereas all other cases were performed by SPs of 
identical gender. These cases had been administered with variable frequency across testing sites 
in 2000. The number of examinees who saw each of these cases ranged from 357 to 565 with 




7 



Modeling Variability in SP exams 

7 

this number being further reduced in the checklist models to those who had already taken 
USMLE Step 2. 

The multi-level modeling software package, HLM5 (Bryk, Raudenbush & Congdon, 
1994) was used to (1) estimate the proportion of score variation between SPs and training sites 
(2) assess the relationship between the skill scores (checklist or IPS scores) and SP 
characteristics and (3) quantify the proportion of variation explained when SP characteristics are 
added into the model. Each of the seven skill scores (3 checklist and 4 IPS) was modeled 
separately since prior research has reported that IPS scores are more prone to SP variation. Skill 
scores were also modeled as a function of the number of encounters performed by the SP over 
the course of the testing period and gender (solely for the second routine counseling case 
portrayed by both a male and female SP). The number of encounters served as a proxy for test 
administration experience. USMLE Step 2 was used as a covariate for the checklist models 
since it was moderately correlated with SP checklist scores. However, since the relationship with 
IPS scores was weak, these were run as intercept-only models with no covariate. 

Modeling Skill Scores in SP Examinations 

A 3-level one-way ANOVA with random effects was run to estimate variation (1) among 
students encountering a given SP (2) among SPs at a given site and (3) across sites. Predictors 
were entered in the ANOVA in a block entry fashion. First, USMLE Step 2 scores were entered 
at the student level (Level 1) to adjust for ability (for checklist scores only). The number of SP 
encounters was entered at the SP level (Level 2) to help explain variation between SPs at a given 
site. Gender was also added as a SP level predictor for the routine counseling case performed by 
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both a male and a female SP. Due to the limited sample sizes, none of the models included test 
site predictors (at Level 3). Random effects which were not statistically significant were 
removed from the models to reduce the number of parameters. Models with no variation among 
testing sites were reduced to 2-level models prior to adding predictors. Finally, models with little 
variation among SPs or testing sites were not modeled with predictors. More detail is provided 
below. 

Model 1-One way ANOVA 

Prior to entering predictors into the model, the one-way ANOVA with random effects 
was run to estimate the differences in means at each level. This is an important step to undertake 
because it provides a point estimate of the grand mean and is necessary to measure the variation 
explained when predictors are entered in subsequent models. Using this model, intra-class 
correlations were calculated to determine the proportion of variance in skill scores attributable to 
training site and standardized patients. The ANOVA models and those including predictors are 
provided in Table 3. The parameters are interpreted as follows: 

Yooo Grand mean (Checklist or IPS); 

Pook The mean in site k for SPs; 

Tiojk The mean for SP j in site k ; 

Uook The deviation in site k’s mean from the grand mean (training site effect); 

r 0j k The deviation in SP j’s mean from site k’s mean (SP effect); 

ejjk The deviation in student ijk's score from his/her SP’s mean (student effect). 




9 



Modeling Variability in SP exams 

9 

Checklist Scores 

Equation 1 describes the level- 1 model for checklist scores whereby centering the 
covariate around the grand mean produces an intercept adjusted for student ability. 

CHECKLIST =7t 0j k + ftijk (Step 2,y-Step 2..)+ e ljk (1) 

Thus, 7t becomes the mean checklist score for SP/ in site k after adjusting for Step 2. 
Similarly, the variance at the student level, e iJk is the residual variance after adjusting for Step 2. 
We assume this residual variation to be independent and normally distributed. The slope 
coefficient, n IJk , reflects the number of Step 2 score points required to increase the checklist score 
by 1 percent. At this stage the level-2 and level-3 models are both unconditional (contain no 
predictors). 

Subsequent to adding Step 2, the number of SP encounters (#SPenc) was entered to 
predict variation in SPs. This 2-level model is shown below in equation 2. Since #SPenc is 

Jtqj k = Pook + Poi ( ttSPencj - #SPenc..) + r 0j k (2) 

centered around the grand mean, p 0 i reflects the number of SP encounters required to produce a 
1 point increase in the checklist score for a student of average ability. Whereas 7t 0 j k is the Step 2- 
adjusted checklist score for SP/ in site k, (3 00 k is the Step 2-adjusted checklist score in site k further 
adjusted for #SPenc. Level-3 is left unconditional, i.e. we are not attempting to use predictors at 
the training site level to model SP variation. This series of checklist score models is delineated 
in Table 3 a. 
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IPS scores 

The IPS scores were modeled similarly to checklist scores with the exception of a level- 1 
covariate which was excluded. The random intercept model including the SP predictor (#SPenc) 
is provided in Table 3b (model 2). Note that the interpretation of p 0 i differs without the 
covariate and now reflects the number of SP encounters that are required to produce a 1 point 
increase in IPS scores without having conditioned on ability. 

The routine counseling case performed by a male and female SP was modeled differently 
from the other models (see Table 3c). A 3-level model was not utilized because there were not 
data from multiple SPs for many of the sites. Another difference is that checklist scores were not 
modeled due to lack of Step 2 data. Unlike previous models, the gender indicator variable 
(Female) was entered prior to #SPenc. Equation 3 shows the 2-level model in which the 
intercept, y 00 , represents the average IPS score for a male who has seen the case by a SP with 
average experience. 

Poj = Too + Yoi (Female) + y 02 (#SPenc - USPenc..) + u 0 j (3) 

The gender coefficient, y 0 i reflects the difference in means between female and male SPs and y 02 
represents the number of encounters required for the IPS score to increase by 1 point. The 
intercept, y 0 o, is the IPS score expected to be given by a male SP with average experience. 
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Results 

Proportion of Score Variation among SPs and Training Sites (ANOVA) 

Results from the seven ANOVA models indicated that checklists were more robust to 
variation among SPs within a given site (2%-3%) whereas IPS scores varied substantially across 
SPs in a given site (26%-49%). Irrespective of the nature of the case, less than 12% of the 
variation in checklist scores was due to differences across testing sites. Checklist scores for the 
biomedical case displayed the most inter-site variability (11.6%), whereas the grave illness case 
varied the least (3%) among sites. There was no variation in interpersonal skill scores across 
testing sites with the exception of one routine counseling cases where 16% of the variation in IPS 
scores was attributable to training site. The findings suggested that multilevel modeling was 
appropriate for all skill scores except the communication checklist score. ANOVA results are 
presented in Table 4. 

Checklist scores - Routine counseling encounter 

The variance components estimated in the one-way ANOVA (Table 4) show that 8.3% of 
the variation in scores is due to site differences while only 2.7% is due to SP differences. After 

i 

conditioning on ability, the mean checklist scores still varied across training sites (uook, / 2 =29.06, 
df=8), indicating that it is possible to explain and reduce this variation. When #SPenc was added 
to the model the conditional variance component, uook, (representing the variability in the grand 
mean after controlling for the number of SP encounters and Step 2) was still statistically 
significant (uooi s _^=32.52, df= 8). As mentioned previously, no site predictor variables were 
available to model site variation in the current investigation. 
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Regarding SP variation, we reject the null hypothesis and conclude that there are 
differences between SPs after controlling for the number of SP encounters and ability (r ojk, 
^=18.79, df=9). Although Step 2 was statistically significant, the number of SP encounters was 
not statistically related to checklist scores. The variance explained by the inclusion of this 
predictor (23%) is misleading because there is little variation among SPs to begin with. 
Consequently, a reduction by 23% amounts to less than 1% of the variation in total scores. The 
deviance statistics indicate that the addition of #SPenc is not justified over the more 
parsimonious model containing only Step 2 (jf = 0.94, df= 1). Due to the unique modifications 
made to each of the models, random coefficients are not provided in table 4. 

Checklist scores - Biomedical encounter 

The results of the 3-level ANOVA in Table 4 show that approximately 1 1.6% of the 
variation in checklist scores was due to training site differences and that only 2.9% was due to 
differences among SPs. This model indicates that after adjusting for ability, students who 
encounter a SP with average experience (#SPenc) will have a checklist score of approximately 
53%. Their score is estimated to increase by 1 for every 6 points they achieve on the Step 2 
examination (over and above the mean Step 2 score for this sample). Similar to findings reported 
for the routine counseling case, #SPenc was not significantly related to checklist score, most 
likely due to the small proportion of variation left to explain. The addition of #SPenc is therefore 
not justifiable (difference in deviance between Step 2 model and #SPenc model = 0.10 with 1 
degree of freedom). The estimated reliability of the intercept is moderately high in these models 
(.72 to .77) suggesting these data provide an adequate level of precision in estimating p 0 ok from 
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the current sample. 

IPS scores - Biomedical Encounter 

SP predictors were added to a 2-level model because there was no variation across 
training sites. This proxy for test administration experience was not significantly related (y 0/ = - 
.05, t= -0.40) to IPS score variation. Consequently, modeling IPS scores using this SP 
characteristic did not explain variation among SPs. The significance of the SP effect (u oj, 
^=380.19 df= 25) indicates that after adding this predictor, the mean IPS scores still varied 
around the grand mean. Had the x statistic not been statistically significant we would have 
concluded that there was no variation among SP means. However, we failed to reject the null 
hypothesis of homogeneity. 

The difference in deviance statistics between model 1 and model 2 was distributed 
approximately as ^=0.00 with 1 degree of freedom which indicates that the addition of #SPenc 
was not justified. In summary, the ANOVA model was most useful in describing the variation 
in these scores: 26.5% at the SP level and 0% at the training site level. Other SP characteristics 
are necessary to decompose sources of variation in IPS scores for biomedical encounters. 

IPS scores - Grave illness encounter 

The results for this series of analyses mirror those reported above. Again, the number of 
SP encounters was not useful (y 0i = -0.04, t= -0.24) in explaining differences among SPs. This 
model did not explain any of the variance at the SP level (44.8%). Consequently, there remains 
a great deal of unexplained variation, (uoy, ^=376.30, df= 21). The addition of #SPenc is not 
justified over the 3-level ANOVA. 
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IPS Scores - Routine Counseling 1 

The results provided in Table 4 indicate that 27.1% of the variation in IPS scores was 
related to SP differences. Similar to the biomedical encounter, a 3-level model was not 
appropriate since IPS scores did not vary across testing sites. Adding the number of encounters 
(#SPenc) into the 2-level model explained 56.4% of this variation among SPs although this 
predictor was not statistically related to IPS scores. The results of these scores differ from 
previous findings but still do not indicate that test administration experience plays a role in score 
variability. 

IPS Scores - Routine Counseling 2 (male & female SP) 

IPS scores for the routine counseling case portrayed by both a male and female were also 
not influenced by test administration experience. In fact, the results from table 4 show that 
neither gender ( y 0 , = 4.08, t= 0.72) nor #SPenc (y 02 = 0.13, t= 0.82) was statistically related to 
IPS. Adding these predictors decreased the intercept from 78.12 (ANOVA) to 75.88 (Female) 
and finally to 72.69 (Female & #SPenc). However, the variation in y 00 decreased by only 3.1% in 
Model 2 and 7.6% in the final model. Neither model was justified. 

Discussion 

This investigation has provided important information for the consortia that develop SP 
tests to be administered across multiple testing sites and within a given site. Results of previous 
generalizability analyses have demonstrated that there is little variation in scores across SPs and 
sites, and that most of the variation in scores is due to case specificity, i.e. the nature of the 
complaint. The findings from this study show us that although SP variability was negligible for 
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checklist scores, variability was quite large for interpersonal skills scores. This result is 
supported by previous research indicating that interpersonal skills score are more influenced by 
variations among SPs. Based on this finding it is important to consider how these instruments 
are developed and to do so in such a way that limits rater subjectivity. For example, asking the 
rater to record how he or she ‘felt’ may not be adequate. Rather, videotaped encounters that 
establish baselines of interpersonal skill levels may be more effective. 

One result not anticipated based on past research was the systematic SP 
stringency/leniency across training sites for checklist scores of certain cases. After adjusting for 
ability, several checklist scores varied by as much as 10% as a function of the testing site. This 
inter-site variability in checklist scores could have resulted from differences in training across 
sites or possibly because guidelines to scoring checklists were not clear (i.e. what a student must 
do to receive credit for completing a given behavior). This might also be due to the various 
levels of adherence to protocols on the part of different trainers. Although past research has 
shown the effect of testing site to be minimal at the overall test level, this study underscores the 
importance of looking at cases on an individual basis. If systematic stringency or leniency 
among SPs or sites is detected during pre-tesing, steps could be taken prior to live administration 
to remedy the situation. Unlike checklist score, IPS scores did not differ as a function of testing 
site, with the exception of one routine counseling case. 

Although a major advantage of using multilevel modeling is to explain variation at 
various levels, the variables utilized in the current investigation were not helpful in explaining 
the variation between SPs. The benefit of adding predictors into multilevel models diminishes if 
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there is not a substantial amount of variation left to model (generally about 10%). Therefore, 
based on the negligible amount of SP variability in the checklist scores it is doubtful that test 
administration experience or any predictor, for that matter, could have explained this variation. It 
was surprising, however, that the number of encounters performed by the SP had little impact in 
explaining variation in IPS scores. The aggregated nature of this variable may have contributed 
to the lack of significance. It is also possible that results might have differed had an adequate 
covariate been used to adjust for ability on the IPS scores. 

This preliminary study has several limitations that must be addressed. As mentioned 
above, the results from IPS models must be interpreted cautiously because the model lacked an 
adequate covariate. However, in a 3-level model the proportion of variation among SPs is 
associated with those performing at a given site. Under the assumption that students at a given 
school are of the same relative ability, we would not anticipate the proportion of variation 
estimated in the ANOVA to be so large. A second limitation is related to the limited nature of 
the data set. A more expansive data set containing extensive student, SP and site related 
information would be desirable in order to explore additional sources of SP and site variation. A 
final limitation is that for the most part, only one case was selected to represent the nature of 
encounters. It is important to note that using HLM at the case level precludes the estimation of 
SP or site variation across the entire examination (as is estimated using generalizability analysis). 
The methodology described in this study is more desirable for pre-testing cases than estimating 
variance proportions at the exam level. Since methodology should be selected to answer the 
research question at hand, this is not seen as a limitation. 
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Unlike generalizability analysis, HLM models are able to distinguish which factors of SPs 
or sites may be responsible for score variation. This information is critical to large scale testing 
organizations that are considering selection of equating, calibrating and scoring procedures. If 
the sources of score variation are known, then steps can be taken a priori to reduce this variation 
and scoring and calibrating routines may be developed to compensate for variability in the 
measures. 

Overall, this study should be viewed as an important first step in utilizing multi-level 
modeling to explore the variability of SP examinations. Some results from this investigation 
confirmed those found in previous research whereas other findings offered a different 
perspective. Given the high stakes involved in a licensing examination, it appears unwise to 
adopt the attitude that stringency and leniency of SPs “wash out” over the course of the (multi- 
case) examination. Given the opportunity, it is advisable to take actions prior to live 
administration to reduce the variation in individual cases. In addition to quantifying the score 
variation, HLM models inform us as to the causes of variation inherent to clinical skills 
encounters. Careful consideration of SP and site characteristics should be captured and analyzed 
statistically so that steps can be taken to develop and implement fair and reliable examinations. 
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TABLE 1: Variable names, Descriptions, Scales 



NAME 


DESCRIPTION 


SCALE 


LEVEL-1 

CHECKLST 


Instrument measures a composite of History Taking, 
Communication and Physical Exam Skills 


Percent Correct (1-100%) 


IPS 


Patient Perception Questionnaire of Interpersonal Skills 


Percent Correct (1-100%) 


STEP 2 


USMLE Step 2 score 


Scale has mean of 220 & s.d of 22 


LEVEL-2 

#SPENC 


Number of encounters performed by the SP 


Interval 


FEMALE 


Dummy coded predictor variable for gender 


Females are coded 1 


LEVEL-3 

NONE 







TABLE 2: Descriptive Statistics 



Encounter 


N 


Mean 


Standard 

Deviation 


Minimum 


Maximum 


Grave Illness 


Checklist 


519 


76.22 


14.55 


22 


100 


Interpersonal Skills 


526 


75.60 


16.09 


20 


100 


#SPenc 


23 


22.87 


14.33 


5 


64 


Routine Counseling (HI) 


Checklist 


538 


70.42 


12.40 


29 


100 


Interpersonal Skills 


543 


76.50 


13.25 


34 


100 


Step 2 


171 


215.83 


19.86 


166 


269 


#Spenc 


19 


16.89 


7.45 


5 


30 


Routine Counseling (#2) 


Checklist 


327 


62.24 


15.64 


8 


100 


Interpersonal Skills 


330 


79.66 


15.79 


31 


100 


#Spenc 


15 


22.00 


17.21 


8 


67 


Female 


15 


0.60 


0.51 


0 


1 


Biomedical 


Checklist 


565 


50.93 


14.54 


14 


86 


Interpersonal Skills 


580 


79.30 


12.87 


37 


100 


Step 2 


190 


214.52 


23.59 


141 


270 


#SPenc 


19 


22.74 


16.14 


8 


75 
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Table 3a: 3 -Level Hierarchical Models for Checklist Data 



Model 


Level 


Equations 


1 (ANOVA) 


i 


CHECKLIST=7t 0 jk + e,jk 




2 


ftojk = Pook + Tojk 




3 


Pook = Yooo+ Uook 


2 


1 


CHECKLIST =7r 0 jk + ^ijk {Step 2) + ejj k 




2 


Koj k = Poo k + r 0 jk 
K\jk = PlOk + f ljk 




3 


Pook = YOOO + Uook 
PlOk = Yl00+ Uiok 


3 


1 


CHECKLIST =7r 0 jk + ^ijk {Step 2) + ey k 




2 


Ttoj k = Pook + Poik {#SPenc) + r 0 j k 
^lj k = PlOk + f ljk 




3 


Pook = YOOO + u 00 k 
Poik = Y 010 + Uoik 

P 1 Ok = Y 1 00 + UlOk 



* italicized variable are centered around the grand mean 



Table 3b: 3-Level Random-Intercept Models for IPS Data 



Model 


Level 


Equations 


1 (ANOVA) 


1 


IPS = TC 0 jk + e ijk 




2 


ftojk = Pook + E)jk 




3 


Pook = Y 000 +u 00 k 


2 


1 


IPS =7r 0 jk+ e ijk 




2 


7t0jk= Pook + Poik ( #SPenc ) + r 0 j k 




3 


Pook = Y 000 + Uook 






Poik = Y 010 + Uoik 



^italicized variable are centered around the grand mean 
Table 3c: 2-Level Random-Intercept Models for IPS data 



Model 


Level 


Equations 


1 (ANOVA) 


1 


IPS = p oj + ly 




2 


Poj = Yoo + Uoj 


2 


1 


IPS = Poj + Tjj 




2 


Poj = Yoo + Yoi (Female) + u oj 


3 


1 


IPS = Poj + Tjj 




2 


Poj = Yoo + Yoi (Female) + y 02 (#SPenc) + u oj 
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Table 4. Variance Proportions Estimated in the one-way-ANOVA with Random Effects Model. Effect 
Sizes and Variation Explained for the Final Model (Including SP Predictors) 



Variance Proportions Final Model 

in ANOVA 



Skill Score/Case 


SP 


Site 


Intcpt 


STEP 2 


#SpEnc 


Female 


Variance 
explained by 
predictors 


Checklist Scores 
















Biomedical 


2.9% 


11.6% 


53.0* 


.16* 


-.03 




6.7% 


Routine Counseling 1 


2.7% 


8.3% 


71.7* 


.09* 


.21 




23.0% 


Grave Illness 


2.0% 


3.0% 


Predictors not added due to small variance proportions 


IPS Scores 
















Biomedical 


26.5% 


0.0% 


79.7* 




-.05 




0.0% 


Routine Counseling 1 


27.1% 


15.7% 


78.1* 




.01 




56.4% 


Routine Counseling 2 


49.4% 


n/a 


72.7* 




.13 


4.08 


7.6% 


Grave Illness 


44.8% 


0.0% 


76.0* 




-.04 




0.0% 



*Effect is statistically significant at the .05 level. 
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